Introduction

Have you recently mastered the introductory concepts in producing statistical models? Congratulations! With this knowledge, it’s now time to consider fine tuning model parameters further in order to improve the ordinary least squares (OLS) approach. One way we can do this is by determining if all of the parameters or variables are important in our model or if we instead can produce a better model after reducing the variables in the model through feature selection. You were likely introduced to subset selection in an introductory statistical model course, but there are other methods, as well, to improving OLS with feature selection. We will cover the following techniques in this tutorial with respect to the each method’s motivation and algorithm:

  1. Subset Selection
    • Best Subsets
    • Stepwise Selection/Elimination
  2. Penalized Regression
    • Ridge
    • Lasso
    • Elastic Net
  3. Dimension Reduction
    • Principal Component Analysis (PCA)
    • Partial Least Squares (PLS)

In addition, we will demonstrate these techniques through a coding example with the following dataset.

Before we dive into dimension reduction, let’s refresh a few key concepts.

Review:

Supervised vs Unsupervised Learning: supervised learning involves modeling a response variable on one or more features, while unsupervised learning involves finding underlying structure in data features without relating to a predictor variable.

Response Variable: the variable of interest in a prediction problem, usually denoted \(Y\).

Features/predictor Variables: data used in modeling, usually denoted \(X\). In a supervised learning problem, these features are used to predict the response variable.

Standardizing Variables: re-scaling variables to have mean zero and variance one. Standardizing can be especially important when variables are on widely differing scales (e.g. one feature on a percentage scale with values between zero and one and another feature measured in dollars on a scale from zero to several thousand).

Bias & Variance: bias is a measure of how far the mean/expected value of an estimator is from the true value, or \(E(\hat Y) - Y\). Variance is a measure of how much the squared expected value of an estimator differs from the expected value of the squared estimator, or \(E(\hat Y^2) - E(\hat Y)^2\). In other words, bias measures the error in the center of the estimate, and variance measures the error in the spread of the estimate. Generally, as model complexity increases, bias increases and variance decreases. This is frequently referred to as the bias-variance tradeoff.

Model Selection Criteria: Model selection criteria are used to evaluate and compare subset regression models. Common criteria used include:

  • Adjusted \(R^2\) = \(1-(\frac{n-1}{n-p})(1-R^2_p)\)
  • Akaike Information Criterion (AIC) = $ -2ln(L)+2p$, where L is the likelihood function for a specific model
  • Bayesian Information Criterion (BIC) = \(-2ln(L)+p*ln(n)\)
  • Mean Square Error (MSE) = \(\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2\)

Why improve on least squares?

OLS is not well-suited to high dimensional data. When the number of predictors \(p\) is greater than the number of observations \(n\), it will be possible to create a model explaining 100% of the variance in the data, which will usually lead to overfitting and poor prediction for new data.

Real Life Applications:

  • DNA Sequencing
  • GDP
  • Housing prices

Subset Selection

One method of improving the ordinary least squares regression is by selecting a subset of the predictor variables to train the model based on how the model performs. Having fewer predictor variables improves the ability to interpret the model because it is easier to view the relationship between the response variable and the predictors. Fewer variables would also decrease the risk of overfitting the model on the training data. The predictor variables are selected based on their performance on OLS, and only the predictors that improve performance are used. The two main processes for choosing which variables to include are best subsets and stepwise regression.

Best Subsets

One method of finding the best subsets is by testing all possible combinations of these predictors to determine which model performs the best, and the combination of predictors that performs the best is selected to be used. Best subsets returns the best model for each number of predictors included, so the best model with one predictor, with two predictors, and so on. A model is typically defined as performing better if it has the minimum MSE, or residual sum of squares, values, but any of the model selection criteria reviewed above may be used. The different model selection criteria may select different combinations of predictors as performing the “best”, so it is valuable to consider multiple options. While this process is simple to implement, the time to calculate all the combinations exponentially increases as the number of potential predictors increases. For example, a model with 10 predictors has 1024 possible combinations. Therefore, this may only be possible if the data includes a small number of variables. It is possible to arbitrarily select the maximum number of predictors to include, but the time to calculate the best predictors to use may still take too long to work as a viable solution. The user must also review the best models for each number of predictions and choose the best one based on the results and context of the problem.

Stepwise-type Procedure

Stepwise-type procedures determine which predictor variables to include by adding or deleting predictors one at a time; the process evaluates fewer models and is less computationally burdensome than comparing all possible subsets. Three categories of stepwise-type procedures include forward selection, backward elimination, and stepwise regression (which is a combination of the previous two).

Forward Selection: The initial model includes only the intercept value and no predictors. Predictors will be added to the model one at a time, as long as the F statistic is greater than a preselected F value called \(F_{IN}\). Most programs utilize a default \(F_{IN}\), so this does not need to be set in order to perform the calculation. The first predictor added to the model is the one with the largest correlation with the response variable; this predictor also generates the most significant F statistic. The F statistic must exceed \(F_{IN}\) to be added to the model. For the second step, the next predictor selected will have the highest correlation with the response variable, given the presence of the first predictor. The partial F (or t) statistic for the second predictor must also exceed \(F_{IN}\). If \(F_{IN}\) is larger, then the predictor is not added to the model and no further predictors are considered to be added to the model. Step 2 is repeated until the partial F statistic of the predictor to be added is less than \(F_{IN}\) or until all predictors have been added to the model.

Backward Elimination: The initial model for backward elimination begins with all of the predictor variables included in the model and drops them one by one until a suitable model is found. This method is preferred if you want to confirm all predictors have been considered and nothing was missed. Backward elimination determines if predictors should be dropped by comparing the partial F (or t) statistic to a preselected value \(F_{OUT}\). Similar to \(F_{IN}\), programs will use a default value for \(F_{OUT}\). The first step of implementing backward elimination is to calculate the partial F statistic for all the predictors. The smallest partial F statistic is compared to \(F_{OUT}\) and if it is smaller than \(F_{OUT}\), then that predictor is removed from the model. The partial F statistics are recalculated given the updated model, and the smallest partial F statistic for the new model is compared to \(F_{OUT}\). The process of removing predictors will continue until the smallest partial F statistic exceeds \(F_{OUT}\).

Stepwise Regression: Stepwise regression is a combination of forward selection and backward elimination. The initial model assumes only the intercept is included, and predictors are added one at a time. The difference is after a predictor is added, all the predictors in the model will be reviewed to determine if they should still be included. In this method, both \(F_{IN}\) and \(F_{OUT}\) must be preselected. There is no required relationship between \(F_{IN}\) and \(F_{OUT}\). Some people prefer to set the two values equal to each other; others would rather set \(F_{IN}\) > \(F_{OUT}\) so it is more difficult to add a predictor than drop one.

The initial model is the same as forward selection, and each predictor added must meet the same criteria as outlined in forward selection. After adding a new predictor, the partial F statistic is calculated for each of the predictors in the model, and the partial F statistic is compared to \(F_{OUT}\). One important difference is all partial F statistics less than \(F_{OUT}\) will be dropped, while only the smallest partial F statistic is dropped in backward elimination. The model will be complete when the partial F statistic for adding a predictor is less than \(F_{IN}\).

Code Implementation

We will show how to implement the various methods using an online news popularity data set. The data set includes 58 predictors and 39644 rows of observations. Examples of the predictors include number of words in the title, number of words in the article, a binary value for each day of the week, and average polarity of the words used. The two non-predictive columns we removed are url and timedelta, which is the time between the article being published and being added to the dataset. The goal is to determine which of the 58 predictors are most useful in using linear regression to find the number of times an article is shared.

The code below shows how to perform all three of the stepwise procedures. The output in R typically shows detailed steps of the different models, but only the summaries have been included here.

# Load the data
news = read.csv('OnlineNewsPopularity.csv') %>% select(-url, -timedelta)

# Define intercept only model
regnull <- lm(shares~1, data=news)
# Define model with all predictors
regfull <- lm(shares~., data=news)
# Perform Forward Selection, start with regnull
forward_selection = step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
# Perform Backward Elimination, start with regfull
backward_elimination = step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
# Perform Stepwise Regression, start with regnull
stepwise_regression = step(regnull, scope=list(lower=regnull, upper=regfull), direction="both")
# Results from Forward Selection
summary(forward_selection)
## 
## Call:
## lm(formula = shares ~ kw_avg_avg + self_reference_min_shares + 
##     kw_max_avg + kw_min_avg + num_hrefs + kw_max_max + avg_negative_polarity + 
##     data_channel_is_entertainment + LDA_03 + average_token_length + 
##     global_subjectivity + n_tokens_title + weekday_is_monday + 
##     kw_min_max + LDA_02 + weekday_is_saturday + num_self_hrefs + 
##     n_tokens_content + self_reference_max_shares + kw_max_min + 
##     data_channel_is_lifestyle + num_keywords + global_rate_positive_words + 
##     min_positive_polarity + abs_title_sentiment_polarity + abs_title_subjectivity + 
##     num_imgs + kw_min_min, data = news)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34610  -2243  -1232   -127 837695 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.414e+03  7.193e+02  -1.965 0.049368 *  
## kw_avg_avg                     1.646e+00  1.268e-01  12.979  < 2e-16 ***
## self_reference_min_shares      2.261e-02  3.366e-03   6.718 1.87e-11 ***
## kw_max_avg                    -2.001e-01  2.354e-02  -8.502  < 2e-16 ***
## kw_min_avg                    -3.600e-01  7.351e-02  -4.897 9.77e-07 ***
## num_hrefs                      2.719e+01  6.408e+00   4.243 2.21e-05 ***
## kw_max_max                    -6.776e-04  5.377e-04  -1.260 0.207663    
## avg_negative_polarity         -1.485e+03  5.215e+02  -2.848 0.004406 ** 
## data_channel_is_entertainment -9.108e+02  1.643e+02  -5.544 2.98e-08 ***
## LDA_03                         5.475e+02  2.595e+02   2.109 0.034908 *  
## average_token_length          -2.908e+02  9.627e+01  -3.020 0.002526 ** 
## global_subjectivity            2.441e+03  7.549e+02   3.234 0.001223 ** 
## n_tokens_title                 8.965e+01  2.833e+01   3.164 0.001555 ** 
## weekday_is_monday              4.636e+02  1.557e+02   2.977 0.002917 ** 
## kw_min_max                    -2.398e-03  1.092e-03  -2.197 0.028047 *  
## LDA_02                        -8.297e+02  2.468e+02  -3.362 0.000776 ***
## weekday_is_saturday            5.886e+02  2.424e+02   2.428 0.015181 *  
## num_self_hrefs                -5.295e+01  1.725e+01  -3.069 0.002149 ** 
## n_tokens_content               1.934e-01  1.537e-01   1.259 0.208126    
## self_reference_max_shares      3.641e-03  1.660e-03   2.193 0.028317 *  
## kw_max_min                     3.506e-02  1.890e-02   1.855 0.063632 .  
## data_channel_is_lifestyle     -5.460e+02  2.684e+02  -2.034 0.041951 *  
## num_keywords                   6.333e+01  3.356e+01   1.887 0.059158 .  
## global_rate_positive_words    -7.843e+03  4.160e+03  -1.886 0.059366 .  
## min_positive_polarity         -1.848e+03  9.263e+02  -1.995 0.046006 *  
## abs_title_sentiment_polarity   5.840e+02  2.841e+02   2.055 0.039849 *  
## abs_title_subjectivity         6.144e+02  3.419e+02   1.797 0.072362 .  
## num_imgs                       1.153e+01  7.999e+00   1.442 0.149423    
## kw_min_min                     2.287e+00  1.613e+00   1.418 0.156278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11500 on 39615 degrees of freedom
## Multiple R-squared:  0.02257,    Adjusted R-squared:  0.02188 
## F-statistic: 32.67 on 28 and 39615 DF,  p-value: < 2.2e-16
# Results from Backward Elimination
summary(backward_elimination)
## 
## Call:
## lm(formula = shares ~ n_tokens_title + n_tokens_content + n_unique_tokens + 
##     n_non_stop_words + num_hrefs + num_self_hrefs + num_imgs + 
##     average_token_length + num_keywords + data_channel_is_lifestyle + 
##     data_channel_is_entertainment + data_channel_is_bus + data_channel_is_socmed + 
##     data_channel_is_tech + data_channel_is_world + kw_min_min + 
##     kw_max_min + kw_min_max + kw_min_avg + kw_max_avg + kw_avg_avg + 
##     self_reference_min_shares + self_reference_max_shares + weekday_is_monday + 
##     weekday_is_saturday + LDA_00 + LDA_03 + LDA_04 + global_subjectivity + 
##     global_rate_positive_words + min_positive_polarity + avg_negative_polarity + 
##     abs_title_subjectivity + abs_title_sentiment_polarity, data = news)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33664  -2239  -1225   -125 837641 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.792e+03  6.872e+02  -2.608 0.009113 ** 
## n_tokens_title                 9.059e+01  2.838e+01   3.192 0.001413 ** 
## n_tokens_content               5.146e-01  2.027e-01   2.539 0.011135 *  
## n_unique_tokens                2.197e+03  9.173e+02   2.395 0.016626 *  
## n_non_stop_words              -1.474e+03  6.175e+02  -2.386 0.017023 *  
## num_hrefs                      2.630e+01  6.440e+00   4.083 4.46e-05 ***
## num_self_hrefs                -5.313e+01  1.750e+01  -3.036 0.002402 ** 
## num_imgs                       1.441e+01  8.198e+00   1.757 0.078856 .  
## average_token_length          -2.689e+02  9.748e+01  -2.759 0.005800 ** 
## num_keywords                   6.037e+01  3.441e+01   1.754 0.079357 .  
## data_channel_is_lifestyle     -1.113e+03  3.774e+02  -2.948 0.003199 ** 
## data_channel_is_entertainment -1.114e+03  2.430e+02  -4.584 4.58e-06 ***
## data_channel_is_bus           -9.359e+02  3.701e+02  -2.529 0.011446 *  
## data_channel_is_socmed        -6.709e+02  3.510e+02  -1.911 0.055974 .  
## data_channel_is_tech          -6.443e+02  3.520e+02  -1.830 0.067183 .  
## data_channel_is_world         -7.234e+02  3.006e+02  -2.407 0.016092 *  
## kw_min_min                     3.825e+00  8.895e-01   4.300 1.72e-05 ***
## kw_max_min                     3.567e-02  1.891e-02   1.886 0.059313 .  
## kw_min_max                    -2.346e-03  1.093e-03  -2.147 0.031806 *  
## kw_min_avg                    -3.462e-01  7.405e-02  -4.676 2.94e-06 ***
## kw_max_avg                    -1.893e-01  2.381e-02  -7.952 1.88e-15 ***
## kw_avg_avg                     1.566e+00  1.293e-01  12.112  < 2e-16 ***
## self_reference_min_shares      2.255e-02  3.366e-03   6.698 2.15e-11 ***
## self_reference_max_shares      3.642e-03  1.661e-03   2.193 0.028322 *  
## weekday_is_monday              4.741e+02  1.557e+02   3.044 0.002334 ** 
## weekday_is_saturday            5.908e+02  2.426e+02   2.436 0.014870 *  
## LDA_00                         1.093e+03  4.435e+02   2.464 0.013751 *  
## LDA_03                         5.936e+02  3.245e+02   1.829 0.067412 .  
## LDA_04                         7.150e+02  4.024e+02   1.777 0.075603 .  
## global_subjectivity            2.550e+03  7.603e+02   3.355 0.000796 ***
## global_rate_positive_words    -8.893e+03  4.210e+03  -2.112 0.034669 *  
## min_positive_polarity         -2.256e+03  9.466e+02  -2.383 0.017187 *  
## avg_negative_polarity         -1.444e+03  5.247e+02  -2.752 0.005932 ** 
## abs_title_subjectivity         6.309e+02  3.421e+02   1.844 0.065150 .  
## abs_title_sentiment_polarity   6.003e+02  2.842e+02   2.112 0.034675 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11500 on 39609 degrees of freedom
## Multiple R-squared:  0.02279,    Adjusted R-squared:  0.02195 
## F-statistic: 27.17 on 34 and 39609 DF,  p-value: < 2.2e-16
# Results from Stepwise Regression
summary(stepwise_regression)
## 
## Call:
## lm(formula = shares ~ kw_avg_avg + self_reference_min_shares + 
##     kw_max_avg + kw_min_avg + num_hrefs + avg_negative_polarity + 
##     data_channel_is_entertainment + LDA_03 + average_token_length + 
##     global_subjectivity + n_tokens_title + weekday_is_monday + 
##     kw_min_max + LDA_02 + weekday_is_saturday + num_self_hrefs + 
##     self_reference_max_shares + kw_max_min + data_channel_is_lifestyle + 
##     num_keywords + global_rate_positive_words + min_positive_polarity + 
##     abs_title_sentiment_polarity + abs_title_subjectivity + num_imgs + 
##     kw_min_min, data = news)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34139  -2236  -1237   -130 837744 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.836e+03  6.182e+02  -2.970 0.002980 ** 
## kw_avg_avg                     1.605e+00  1.243e-01  12.918  < 2e-16 ***
## self_reference_min_shares      2.253e-02  3.365e-03   6.695 2.19e-11 ***
## kw_max_avg                    -1.950e-01  2.331e-02  -8.363  < 2e-16 ***
## kw_min_avg                    -3.528e-01  7.339e-02  -4.807 1.54e-06 ***
## num_hrefs                      2.938e+01  6.148e+00   4.779 1.77e-06 ***
## avg_negative_polarity         -1.524e+03  5.206e+02  -2.928 0.003413 ** 
## data_channel_is_entertainment -9.021e+02  1.637e+02  -5.511 3.59e-08 ***
## LDA_03                         5.338e+02  2.555e+02   2.089 0.036700 *  
## average_token_length          -2.917e+02  9.625e+01  -3.031 0.002440 ** 
## global_subjectivity            2.490e+03  7.542e+02   3.301 0.000963 ***
## n_tokens_title                 8.926e+01  2.827e+01   3.158 0.001592 ** 
## weekday_is_monday              4.640e+02  1.557e+02   2.979 0.002890 ** 
## kw_min_max                    -2.361e-03  1.092e-03  -2.163 0.030528 *  
## LDA_02                        -8.253e+02  2.458e+02  -3.357 0.000789 ***
## weekday_is_saturday            5.884e+02  2.424e+02   2.427 0.015210 *  
## num_self_hrefs                -5.039e+01  1.716e+01  -2.937 0.003319 ** 
## self_reference_max_shares      3.645e-03  1.660e-03   2.196 0.028128 *  
## kw_max_min                     3.580e-02  1.890e-02   1.895 0.058146 .  
## data_channel_is_lifestyle     -5.219e+02  2.681e+02  -1.947 0.051548 .  
## num_keywords                   6.242e+01  3.354e+01   1.861 0.062774 .  
## global_rate_positive_words    -7.299e+03  4.147e+03  -1.760 0.078396 .  
## min_positive_polarity         -2.097e+03  9.018e+02  -2.325 0.020071 *  
## abs_title_sentiment_polarity   5.902e+02  2.841e+02   2.078 0.037761 *  
## abs_title_subjectivity         6.279e+02  3.418e+02   1.837 0.066231 .  
## num_imgs                       1.429e+01  7.681e+00   1.861 0.062787 .  
## kw_min_min                     3.932e+00  8.828e-01   4.454 8.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11500 on 39617 degrees of freedom
## Multiple R-squared:  0.0225, Adjusted R-squared:  0.02185 
## F-statistic: 35.07 on 26 and 39617 DF,  p-value: < 2.2e-16

Results

As seen above, the three stepwise-type procedures do not give the same results as each other. Forward selection and stepwise regression have the same output in this scenario, but it is not guaranteed the results from forward selection and stepwise regression would be equal for a different model. Backward elimination includes 2 more predictor variables than the other two methods, which results in a higher adjusted \(R^2\) value. The residual standard errors are the same, and this shows how different combinations of predictors can have similar performances. It is valuable to perform all three of these processes and pick a model based on the context of the problem.

Pros & Cons

Pros

  1. Subset selection is easy to implement and may work with categorical and continuous predictors.

  2. The coefficients assigned to the predictors selected using subset selection are easy to interpret, which is useful for inference problems.

  3. The three stepwise-type procedures are able to be used when there are many predictors.

Cons

  1. When there are a large number of predictors possible, the time to run the best subset formula can become restrictive.

  2. Results of the best subset and stepwise-type procedure may vary based on the model selection parameters used.

  3. If the number of observations is small, then a small change in the data may result in a significant change to the model.

  4. These methods only work for supervised learning problems.

Penalized Regression

The subset method focuses on using least squares to improve on regression using an “all or nothing” approach. The results from the subset selection methods can vary based on the model used, and the methods in subset selection are not often robust to data of all sizes. rendering the subset selection method unstable. Shrinkage methods are useful in cases where subset selection methods are unstable. Shrinkage methods shrink the coefficients closer to 0 to help improve model fit and predictive power by reducing the variance of the coefficient estimates generated. Shrinkage accomplishes this by using a penalized regression method, where a penalty is imposed for having many variables in the model. Penalized regression methods are available for linear and logistic regression depending on the type of outcome we are trying to predict.

There are three well known methods for penalized regression: ridge, lasso, and elastic net. Each method imposes the penalty by using a tuning parameter called \(\lambda\), the \(\lambda\) value is usually selected using cross-validation. The minimum \(\lambda\) value generated is the best tuning parameter used in each of the models. The primary difference in the three methods is in their choice of \(\alpha\). Ridge regression uses an \(\alpha\) value of 0 and lasso regression uses an \(\alpha\) value of 1. Elastic net regression provides the flexibility to choose an \(\alpha\) value between 0 and 1.

Code Implementation

The code below is a demonstration of implementing penalized regression using ridge, lasso, and elastic net regression methods. Since penalized regression aims to improve predictive power of regression models, we determine the predictive performance of each of the three regression methods. In order to generate predictions, we begin by splitting our dataset into train and test data. The train data will be used to build the models, whereas the predictions will be made on the test data. Two parameters, RMSE, Root Mean Squared Error, and \(R^2\) will help to determine the performance of each of the three models.The glmnet package in R is very efficient for fitting ridge, lasso, and elastic net models, for both penalized linear and logistic regression models.

Import the libraries

Exploratory analysis on the data

The summary of the dependent variable shows that it has a wide range of values ranging from 1 - 843,300, with a mean of 3395.

#Import the dataset 
news = read.csv("OnlineNewsPopularity.csv")

#Drop non-predictive variables from the dataset
news = subset(news, select = -c(url,timedelta))

summary(news$shares)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     946    1400    3395    2800  843300
#Split the data into train and test data 
set.seed(123)
split = sort(sample(nrow(news), nrow(news)*.8)) #80% train samples 
train<-news[split,]
test<-news[-split,]

We will compute the penalized linear regression on the train data set.

#Specify the predictor and response variables 
#y is the response variable --> shares
y.train <- train$shares

#x is the matrix with the predictor variables 
x.train <- model.matrix(train$shares~., train)

#Remove the intercept column that X.train generates intercept in the first column
x.train = x.train[,-1]

Ridge Regression

The set up of a ridge regression model begins by setting up the 10-fold cross-validation model to determine the tuning parameter, \(\lambda\). The cross-validation model is set up by using the glmnet package in R. The parameters needed to compute the \(\lambda\) value include \(\alpha\) = 1, and the number of cross-validation folds. Ridge regression has a lower mean square prediction error compared to least squares. When \(\lambda\) is 0, the variance is the same as that computed by least squares. As \(\lambda\) increases, the variance decreases, when \(\lambda\) approaches \(\infty\), the variance approaches closer to 0.

In order to compute ridge regression, we set the \(\alpha\) value to 0. The glmnet package fits the model and computes the tuning parameter, \(\lambda\).

# Find the best lambda using cross-validation
set.seed(202) 
cv_ridge <- cv.glmnet(x.train, y.train, alpha = 0)
# Display the best lambda value
cv_ridge$lambda.min   
## [1] 1196.618
# Fit the ridge regression model on the training data
ridge <- glmnet(x.train, y.train, alpha = 0, lambda = cv_ridge$lambda.min)
# Display regression coefficients
coef(ridge)
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                          s0
## (Intercept)                    5.315400e+02
## n_tokens_title                 6.447372e+01
## n_tokens_content              -1.233582e-01
## n_unique_tokens                3.095086e+00
## n_non_stop_words               1.136617e-01
## n_non_stop_unique_tokens       3.324189e+00
## num_hrefs                      3.110251e+01
## num_self_hrefs                -5.226171e+01
## num_imgs                       2.120434e+01
## num_videos                     2.009726e+01
## average_token_length          -2.146994e+02
## num_keywords                   3.525640e+01
## data_channel_is_lifestyle     -5.043507e+02
## data_channel_is_entertainment -1.121642e+03
## data_channel_is_bus           -6.985487e+02
## data_channel_is_socmed        -2.927150e+02
## data_channel_is_tech          -4.789197e+02
## data_channel_is_world         -4.646598e+02
## kw_min_min                     2.416062e+00
## kw_max_min                    -1.389527e-02
## kw_avg_min                    -6.159044e-02
## kw_min_max                    -2.003396e-03
## kw_max_max                    -3.021239e-04
## kw_avg_max                     6.100977e-04
## kw_min_avg                    -7.823197e-02
## kw_max_avg                    -4.934097e-02
## kw_avg_avg                     7.474319e-01
## self_reference_min_shares      7.636801e-03
## self_reference_max_shares      2.470708e-03
## self_reference_avg_sharess     1.821307e-03
## weekday_is_monday              3.037723e+02
## weekday_is_tuesday            -1.951195e+02
## weekday_is_wednesday          -3.873783e+01
## weekday_is_thursday           -1.127246e+02
## weekday_is_friday             -7.796023e+01
## weekday_is_saturday            3.776961e+02
## weekday_is_sunday              1.242207e+01
## is_weekend                     1.794889e+02
## LDA_00                         2.807948e+02
## LDA_01                        -3.192547e+02
## LDA_02                        -9.001340e+02
## LDA_03                         7.429552e+02
## LDA_04                         2.160582e+01
## global_subjectivity            2.548915e+03
## global_sentiment_polarity     -2.463874e+02
## global_rate_positive_words    -2.380978e+03
## global_rate_negative_words     7.879721e+02
## rate_positive_words           -1.522252e+02
## rate_negative_words           -3.742291e+02
## avg_positive_polarity         -7.207101e+02
## min_positive_polarity         -1.169008e+03
## max_positive_polarity          9.704761e+01
## avg_negative_polarity         -2.034620e+02
## min_negative_polarity         -3.017380e+02
## max_negative_polarity         -8.356904e+02
## title_subjectivity             4.435000e+01
## title_sentiment_polarity       1.168851e+02
## abs_title_subjectivity         8.025450e+02
## abs_title_sentiment_polarity   6.942200e+02

Next, we will use the ridge regression model constructed with the train data to make predictions on the test data.

# Make predictions on the test data
x.test <- model.matrix(test$shares ~., test)#[,-1]

#Drop the first column from the x.test matrix which is the intercept
x.test <- x.test[,-1]

predictions_ridge <- ridge %>% predict(x.test)

# Model performance metrics
RMSE_ridge = RMSE(predictions_ridge, test$shares)
Rsquare_ridge = R2(predictions_ridge, test$shares)
data.frame(RMSE = RMSE_ridge, R2 = Rsquare_ridge)
##       RMSE         s0
## 1 14274.39 0.01783234

Pros

  1. The ridge penalty decreases the variance of the data. A decreased variance allows for better model performance. However, decrease in variance increases the bias. The bias-variance trade off allows for improved model performance.
  2. The computational efficiency of ridge regression is higher compared to subset selection.

Cons 1. The ridge penalty increases bias in the model. 2. The ridge penalty includes all predictors in the model, unlike subset selection methods where a subset of the predictor variables are considered, not giving the user a chance to create a model using a subset of variables. 3. The ability to interpret from models produced using a ridge penalty is not very high.

Lasso Regression

The set up of a lasso penalty is very similar to that of the ridge penalty. We set up a 10-fold cross-validation model to determine the tuning parameter \(\lambda\). Similar to ridge regression, we use the glmnet package in R to compute the \(\lambda\) value. As mentioned earlier, the key difference between ridge and lasso is the choice of \(\alpha\). An \(\alpha\) value of 1 is used to construct a lasso model along with the specified number of cross-validation folds. One advantage of lasso over ridge is that lasso can shrink coefficients completely to 0, unlike the ridge penalty that shrinks coefficients close to 0. Shrinking coefficients down to 0 removes unhelpful predictor variables from the model. This is similar to the subset selection method where only a subset of the predictor variables are considered.

In order to compute lasso regression, we set the \(\alpha\) value to 1. The glmnet package fits the model and computes the tuning parameter, \(\lambda\).

# Find the best lambda using cross-validation
set.seed(202) 
cv_lasso <- cv.glmnet(x.train, y.train, alpha = 1)
# Display the best lambda value
cv_lasso$lambda.min  
## [1] 28.29391
# Fit the ridge regression model on the training data
lasso <- glmnet(x.train, y.train, alpha = 1, lambda = cv_lasso$lambda.min)
# Display regression coefficients
coef(lasso)
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                          s0
## (Intercept)                   -6.020172e+02
## n_tokens_title                 6.122662e+01
## n_tokens_content               .           
## n_unique_tokens                7.075435e-02
## n_non_stop_words               .           
## n_non_stop_unique_tokens       .           
## num_hrefs                      3.094027e+01
## num_self_hrefs                -4.683322e+01
## num_imgs                       1.641107e+01
## num_videos                     1.457825e+01
## average_token_length          -2.242244e+02
## num_keywords                   7.569121e+00
## data_channel_is_lifestyle     -1.607200e+02
## data_channel_is_entertainment -9.394971e+02
## data_channel_is_bus           -1.611592e+02
## data_channel_is_socmed         .           
## data_channel_is_tech          -5.117892e+01
## data_channel_is_world          .           
## kw_min_min                     2.681926e+00
## kw_max_min                     .           
## kw_avg_min                    -9.819949e-02
## kw_min_max                    -1.734523e-03
## kw_max_max                    -3.127683e-04
## kw_avg_max                     .           
## kw_min_avg                    -2.493154e-01
## kw_max_avg                    -1.341927e-01
## kw_avg_avg                     1.288923e+00
## self_reference_min_shares      8.446495e-03
## self_reference_max_shares      2.621438e-03
## self_reference_avg_sharess     .           
## weekday_is_monday              3.422202e+02
## weekday_is_tuesday            -8.946535e+01
## weekday_is_wednesday           .           
## weekday_is_thursday            .           
## weekday_is_friday              .           
## weekday_is_saturday            3.261783e+02
## weekday_is_sunday              .           
## is_weekend                     1.982845e+02
## LDA_00                         3.424033e+01
## LDA_01                        -4.344960e+01
## LDA_02                        -7.586311e+02
## LDA_03                         7.852135e+02
## LDA_04                         .           
## global_subjectivity            2.263121e+03
## global_sentiment_polarity      .           
## global_rate_positive_words     .           
## global_rate_negative_words     .           
## rate_positive_words            .           
## rate_negative_words            .           
## avg_positive_polarity         -4.838425e+02
## min_positive_polarity         -9.829018e+02
## max_positive_polarity          .           
## avg_negative_polarity         -3.833825e+02
## min_negative_polarity         -1.155690e+02
## max_negative_polarity         -5.681276e+02
## title_subjectivity             .           
## title_sentiment_polarity       .           
## abs_title_subjectivity         6.493182e+02
## abs_title_sentiment_polarity   6.784563e+02

Next, we will use the lasso regression model constructed with the train data to make predictions on the test data.

# Make predictions on the test data
predictions_lasso <- lasso %>% predict(x.test)

# Model performance metrics
RMSE_lasso = RMSE(predictions_lasso, test$shares)
Rsquare_lasso = R2(predictions_lasso, test$shares)
data.frame(RMSE = RMSE_lasso, R2 = Rsquare_lasso)
##       RMSE         s0
## 1 14268.18 0.01866362

Pros

  1. The lasso penalty can remove unhelpful predictors from the model by setting their coefficients to 0, leaving only useful predictors to improve predictive power.

  2. Models generated using the lasso penalty have better interpretability compared to the ridge penalty.

  3. Lasso regression is also much more computationally efficient compared to the subset selection method.

  4. Lasso regression, like ridge regression, has a bias-variance trade-off, and decreases variance, thus improving model performance.

Cons

  1. Since the lasso penalty removes predictors, it increases bias by only incorporating certain predictors.

From the description and results above, we can see that the both ridge and lasso penalties have slight differences, but have similar performance, and one does not outperform the other in all aspects.

Elastic Net Regression

The elastic net regression penalty is slightly different than the lasso and ridge penalties. Here the \(\alpha\) value can be between 0 and 1. A random \(\alpha\) value between 0 and 1 can be chosen, or the caret package in R can help automatically select the best \(\alpha\) and \(\lambda\) parameters. A combination of the best \(\alpha\) and \(\lambda\) minimizes the error rate.

Pros

  1. Elastic net avoids hard coding the \(\alpha\) value and automatically selects the best combination of \(\alpha\) and \(\lambda\) values to fit the model. The RMSE and \(R^2\) values are the lowest for the elastic net model, indicating that it performs better than the lasso and ridge regression models, since it results in the least prediction error.

Cons

  1. The elastic net model is computationally inefficient in comparison with lasso and ridge regression.

Elastic net can be computed using two methods. The first method will automatically compute the best alpha and lambda values.

#Define the model and find the best lambda using 10-fold cross validation 
set.seed(202)
elasticnet <- train(shares ~., data = train, method = "glmnet", trControl = trainControl("cv", number = 10),  tuneLength = 10)

#Compute coefficients for the model 
coef(elasticnet$finalModel, elasticnet$bestTune$lambda) 
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                           1
## (Intercept)                   -6.086767e+02
## n_tokens_title                 6.096168e+01
## n_tokens_content               .           
## n_unique_tokens                9.864408e-02
## n_non_stop_words               .           
## n_non_stop_unique_tokens       .           
## num_hrefs                      3.092007e+01
## num_self_hrefs                -4.685020e+01
## num_imgs                       1.641840e+01
## num_videos                     1.463544e+01
## average_token_length          -2.242787e+02
## num_keywords                   7.958360e+00
## data_channel_is_lifestyle     -1.455686e+02
## data_channel_is_entertainment -9.495347e+02
## data_channel_is_bus           -1.571203e+02
## data_channel_is_socmed         .           
## data_channel_is_tech          -3.980603e+01
## data_channel_is_world          .           
## kw_min_min                     2.709793e+00
## kw_max_min                     .           
## kw_avg_min                    -9.634882e-02
## kw_min_max                    -1.722391e-03
## kw_max_max                    -2.880619e-04
## kw_avg_max                     .           
## kw_min_avg                    -2.414621e-01
## kw_max_avg                    -1.310833e-01
## kw_avg_avg                     1.269401e+00
## self_reference_min_shares      8.432876e-03
## self_reference_max_shares      2.625916e-03
## self_reference_avg_sharess     .           
## weekday_is_monday              3.411827e+02
## weekday_is_tuesday            -8.906019e+01
## weekday_is_wednesday           .           
## weekday_is_thursday            .           
## weekday_is_friday              .           
## weekday_is_saturday            3.266023e+02
## weekday_is_sunday              .           
## is_weekend                     1.971661e+02
## LDA_00                         5.131292e+01
## LDA_01                        -6.873845e+00
## LDA_02                        -7.454948e+02
## LDA_03                         8.267474e+02
## LDA_04                         .           
## global_subjectivity            2.271835e+03
## global_sentiment_polarity      .           
## global_rate_positive_words     .           
## global_rate_negative_words     .           
## rate_positive_words            .           
## rate_negative_words            .           
## avg_positive_polarity         -4.850011e+02
## min_positive_polarity         -9.764213e+02
## max_positive_polarity          .           
## avg_negative_polarity         -2.842540e+02
## min_negative_polarity         -1.444689e+02
## max_negative_polarity         -6.330207e+02
## title_subjectivity             .           
## title_sentiment_polarity       .           
## abs_title_subjectivity         6.477365e+02
## abs_title_sentiment_polarity   6.781005e+02
#above we are getting coefficients for the best model 

Elastic net can also be computed using glmnet directly in R. Here we need to specify an \(\alpha\) value of our choice, and the model computes the best lambda value.

# Find the best lambda using cross-validation
set.seed(202)
cv_elasticnet = cv.glmnet(x = x.train, y=train$shares, alpha=0.8)
lambda.hat = cv_elasticnet$lambda.min
lambda.hat  
## [1] 35.36738
# Fit the elastic net regression model on the training data
elasticnet_glm <- glmnet(x.train, y.train, alpha = 0.8, lambda = cv_elasticnet$lambda.min)
# Display regression coefficients
coef(elasticnet_glm)
## 59 x 1 sparse Matrix of class "dgCMatrix"
##                                          s0
## (Intercept)                   -5.885660e+02
## n_tokens_title                 6.110030e+01
## n_tokens_content               .           
## n_unique_tokens                6.787211e-02
## n_non_stop_words               .           
## n_non_stop_unique_tokens       .           
## num_hrefs                      3.093699e+01
## num_self_hrefs                -4.684245e+01
## num_imgs                       1.642880e+01
## num_videos                     1.458389e+01
## average_token_length          -2.245860e+02
## num_keywords                   7.698989e+00
## data_channel_is_lifestyle     -1.598475e+02
## data_channel_is_entertainment -9.422223e+02
## data_channel_is_bus           -1.625094e+02
## data_channel_is_socmed         .           
## data_channel_is_tech          -5.241918e+01
## data_channel_is_world          .           
## kw_min_min                     2.681249e+00
## kw_max_min                     .           
## kw_avg_min                    -9.873371e-02
## kw_min_max                    -1.731488e-03
## kw_max_max                    -3.063569e-04
## kw_avg_max                     .           
## kw_min_avg                    -2.456792e-01
## kw_max_avg                    -1.326012e-01
## kw_avg_avg                     1.279411e+00
## self_reference_min_shares      8.445752e-03
## self_reference_max_shares      2.621808e-03
## self_reference_avg_sharess     .           
## weekday_is_monday              3.417161e+02
## weekday_is_tuesday            -8.943150e+01
## weekday_is_wednesday           .           
## weekday_is_thursday            .           
## weekday_is_friday              .           
## weekday_is_saturday            3.262061e+02
## weekday_is_sunday              .           
## is_weekend                     1.981741e+02
## LDA_00                         3.526992e+01
## LDA_01                        -4.001172e+01
## LDA_02                        -7.628426e+02
## LDA_03                         7.929370e+02
## LDA_04                         .           
## global_subjectivity            2.262254e+03
## global_sentiment_polarity      .           
## global_rate_positive_words     .           
## global_rate_negative_words     .           
## rate_positive_words            .           
## rate_negative_words            .           
## avg_positive_polarity         -4.824727e+02
## min_positive_polarity         -9.811835e+02
## max_positive_polarity          .           
## avg_negative_polarity         -3.842968e+02
## min_negative_polarity         -1.153100e+02
## max_negative_polarity         -5.683147e+02
## title_subjectivity             .           
## title_sentiment_polarity       .           
## abs_title_subjectivity         6.486684e+02
## abs_title_sentiment_polarity   6.780768e+02

Next, we will use the elastic net regression model constructed with the train data to make predictions on the test data.

#Predictions on the test data 
predictions_elasticnet <- elasticnet %>% predict(x.test)

# Evaluate the model performance using RMSE and R2 metrics
RMSE_1 = min(elasticnet$results$RMSE)  
Rsquare_1 = min(elasticnet$results$Rsquared)
data.frame(RMSE = RMSE_1, R2 = Rsquare_1)
##       RMSE         R2
## 1 9665.946 0.02007253

Performance for all 3 models

As we can see, all 3 models easily generate predictions using the glmnet package in R. In terms of model performance, we can see that the best RMSE generated using the elastic net penalty is lower than the ridge and lasso models. Both ridge and Lasso models have very close RMSE and \(R^2\) values, whereas the elastic net method outperforms them in terms of RMSE. Penalized regression is easy to compute and is most helpful when we have large datasets that are multivariate. It is recommended to compute all 3 models and to proceed with the model specific to the problem and the one that produces the best predictions with the lowest RMSE.

Dimension Reduction

In the methods described so far, we have focused on feature selection and eliminating features that we decide are not significant enough for our model. However, in entirely dropping these features - however insignificant they may be - we completely miss out on any benefits these dropped variables may contribute. Instead, we could incorporate all variables strategically in order to extract as much information as possible using dimension reduction. Dimension reduction entails creating an \(M\)-dimensional subspace where \(M\) combinations are constructed as a combination of \(p\) predictors. The \(M\) combinations of predictors are then used to fit a least squares linear regression model. In order to benefit from this approach, \(M < p\) must hold. If \(M = p\), then the resulting model amounts to using a least squares fit using all of the original predictors. We especially benefit from this approach when \(p\) (number of predictors) is large relative to n (number of observations), as using \(M < p\) can reduce the variance of the model.

Dimension reduction methods generally follow two steps:
1. Obtain the transformed predictors \(Z_1\), \(Z_2\),…\(Z_M\)
2. Fit the model using these predictors

In the following sections, we will discuss two popular methods for dimension reduction: Principal Component Analysis (PCA) and Partial Least Squares (PLS).

Principal Components Analysis (PCA)

Principal Components Analysis is a dimension reduction technique in which variables named principal components are constructed as linear combinations of our original variables. There are as many principal components created as there are original variables in our data. However, in order to attain a model with reduced dimensions, we will aim to drop a number of these principal components. To determine which principal components can be excluded from our model, we will rank the principal components by their importance in predicting the response variable. Then, only the most important principal components are used in the linear model.

We will likely want to use this method when we seek to reduce the number of variables, but do not know which ones. However, we will see that one drawback of this method is its interpretability, as we are no longer using our original predictors to predict the response but are instead using transformed predictors that each include all of our original predictors.

Before we dive into the algorithm and steps behind PCA, it will be helpful to review how and why we use variance to rank principal components by importance.

Constructing Principal Components

Principal components represent the directions of the data that explain a maximal amount of variance. In understanding this method, it’s important to recognize that variance explained is equivalent to information captured in a model. The more variance in the data that we can explain through a feature, the better.

The image below will help explain how this concept ties into principal component analysis.

(Source: Z, Jaadi, “A Step-by-Step Explanation of Principal Component Analysis.)

The goal of principal component analysis is to compress as much information or variance in the first components, and thus the first principal component accounts for the largest variance in the dataset.

In the scatterplot above, we aim to construct a principal component or line that maximizes this variance, which is when the red dots are spread out the most along the line. This line then represents more variance and the larger the variance represented by a line, the larger the dispersion of points along a line, and thus the more information the line holds.

Now that we understand how these principal components contribute to our model, let’s walk through the steps taken to construct the principal components.

Algorithm Behind the Method:

  1. Center and standardize our original predictor variable matrix, \(X\).
    • This method is sensitive to variance in the predictor variables, so this step is necessary to ensure each variable contributes equally to the model
  2. Compute the covariance matrix by multiplying the transpose of this standardized matrix by itself.
    • Thus, where X has been standardized: \(X^TX\)
  3. Compute the eigenvectors and eigenvalues of the covariance matrix, \(X^TX\).
    • Let’s pause here for a refresher on linear algebra:
      • Eigenvectors represent directions and eigenvalues represent magnitude (or in this scenario, importance). Eigenvectors are stored in a dense matrix while eigenvalues are stored in a matrix with eigenvalues on the diagonal and zeros everywhere else. These eigenvalues on the diagonal are associated with the corresponding column in the eigenvector matrix. For example, an eigenvalue stored on the diagonal in the second column of its matrix corresponds to the second column in the eigenvector matrix.
      • Luckily, there are functions in R that will compute these values for us, but it’s important to understand the eigenvectors and eigenvalues it outputs.

  1. Sort the eigenvalues from largest to smallest
    • Since eigenvalues correspond to their eigenvectors, the eigenvectors are similarly sorted according to the order of their corresponding eigenvalue (put another way, the columns of the eigenvector matrix are reordered according to the reordering of the eigenvalue matrix).
  2. Multiply this sorted eigenvector matrix by the standardized version of X from step 1.
    • As a result, we now have a standardized version of X with weights determined by the eigenvectors.
  3. With this new matrix, determine how many features (principal components) to keep in the new feature matrix. This can be done a couple of ways:
    1. Arbitrarily determine how many dimensions to keep:
      • This is not recommended, but at times may be appropriate for your specific analysis or problem
    2. Set a threshold for proportion of variance explained:
      • As explained earlier, the importance of a principal component or eigenvector is tied to variance. Further, each eigenvalue is approximately the importance of its corresponding eigenvalue (as we saw when sorting our eigenvalues and then eigenvectors). Therefore, we can create a metric for measuring the importance or variance explained by each principal component or eigenvector called the proportion of variance, which is the sum of all eigenvalues kept divided by the sum of all eigenvalues.
      • With this metric, we can then pick a threshold of proportion of variance explained and include principal components up until we reach the set threshold.
    3. Use a scree plot:
      • Similar to the method above, we can use the proportion of variance to determine how many principal components to keep.
      • We calculate the proportion of variance for each principal component, sort the principal components by proportion of variance explained (though they technically should already be sorted in order of importance), then plot the proportion of variance explained by each principal component.
      • We should then have a line plot with an elbow that denotes the point at which there is the largest drop in proportion of variance explained. We then decide to use the principal components up until this drop. In the scree plot below, there is a considerable drop between proportion of variance explained in PC1 to PC2. We would therefore identify PC2 as the elbow and only include the first feature (PC1) and drop the rest. As you can see, a disadvantage of this method is that it is relatively subjective.
  1. Use this new feature matrix with the principal components/eigenvectors we chose to keep as the new X in our linear regression model against the untransformed Y – this type of regression using PCA is known as Principal Components Regression (PCR).

    • Note: Each of these new variables resulting from PCA are all independent of each other, which is an important assumption that must hold for a linear model.

Code Implementation

There are a couple of libraries you can use for PCR - one is pls which also can be used for PLS (covered next). We’ll first show how to run PCR using this library:

library(pls)

Run PCR with cross-validation:

pcr.fit = pcr(shares~., data = news, scale=TRUE, validation ="CV")

In the above code, scale is set to TRUE in order to standardize each predictor. validation is set to CV in order to compute ten-fold cross-validation error for each possible value of M (the number of principal components used). Having the cross-validation error will give us another metric to judge principal components by.

Summary:

summary(pcr.fit)
## Data:    X dimension: 39644 58 
##  Y dimension: 39644 1
## Fit method: svdpc
## Number of components considered: 58
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           11627    11629    11604    12222    12482    12236    12259
## adjCV        11627    11628    11603    12158    12394    12171    12191
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       12189    12402    12417     12406     12397     12703     13507
## adjCV    12127    12320    12334     12324     12316     12594     13325
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV        13491     13551     13529     13485     13596     13556     13435
## adjCV     13309     13363     13343     13303     13405     13367     13257
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV        13422     13546     13597     13745     13267     12597     12668
## adjCV     13245     13358     13405     13541     13103     12494     12559
##        28 comps  29 comps  30 comps  31 comps  32 comps  33 comps  34 comps
## CV        12742     12697     12108     12096     12140     12846     12849
## adjCV     12626     12585     12051     12040     12080     12721     12723
##        35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps
## CV        12929     12418     13047     12271     12332     12322     12354
## adjCV     12796     12331     12902     12198     12253     12244     12273
##        42 comps  43 comps  44 comps  45 comps  46 comps  47 comps  48 comps
## CV        11531     11534     11579     11579     11570     11573     11535
## adjCV     11530     11533     11573     11573     11565     11569     11533
##        49 comps  50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## CV        11595     11569     12045     11909     11881     15445     11519
## adjCV     11587     11563     11991     11869     11844     15098     11517
##         56 comps   57 comps   58 comps
## CV     1.407e+12  1.411e+12  1.411e+12
## adjCV  1.334e+12  1.339e+12  1.339e+12
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       8.27073  15.1673   21.356   26.531   31.220   35.647   39.828    43.76
## shares  0.09982   0.4832    1.129    1.129    1.226    1.228    1.245     1.28
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         47.39    50.908    54.359     57.60    60.380    62.757    65.107
## shares     1.28     1.331     1.331      1.34     1.456     1.587     1.615
##         16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X         67.237    69.355    71.443    73.498    75.520    77.467    79.338
## shares     1.623     1.624     1.634     1.647     1.664     1.664     1.664
##         23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X          81.14    82.701    84.202    85.594    86.838    88.005    89.131
## shares      1.67     1.672     1.739     1.756     1.763     1.763     1.785
##         30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X         90.185    91.174    92.136    93.058    93.869    94.663    95.421
## shares     1.785     1.836     1.856     1.858     1.922     1.922     1.943
##         37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X         96.050      96.6    97.118    97.539    97.954    98.337    98.667
## shares     1.964       2.0     2.002     2.004     2.006     2.009     2.026
##         44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X         98.983    99.206    99.375    99.513    99.643    99.745    99.836
## shares     2.028     2.029     2.029     2.032     2.195     2.198     2.206
##         51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps
## X         99.901    99.955    99.999   100.000    100.00    100.00    100.00
## shares     2.284     2.285     2.293     2.301      2.31      2.31      2.31
##         58 comps
## X         100.00
## shares      2.31

From this summary, we see that the smallest RMSE is at 55 components (M=55), at which point 100% of variance is explained.

More info on the pls package is available here.

Next, we can use the function prcomp() within the core stats package for PCR. The difference in setting up pcr() vs. prcomp() is that prcomp() needs the data passed in with no response variable. Therefore, we create an X from our data, removing our response variable, shares.

X = news %>% dplyr::select(-shares)

prcomp.fit = prcomp(X, scale = TRUE)

summary(prcomp.fit)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.19021 2.00001 1.89462 1.73250 1.64898 1.60249 1.55731
## Proportion of Variance 0.08271 0.06897 0.06189 0.05175 0.04688 0.04428 0.04181
## Cumulative Proportion  0.08271 0.15167 0.21356 0.26531 0.31220 0.35647 0.39828
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.50916 1.45274 1.42758 1.41471 1.37207 1.26879 1.17421
## Proportion of Variance 0.03927 0.03639 0.03514 0.03451 0.03246 0.02776 0.02377
## Cumulative Proportion  0.43755 0.47394 0.50908 0.54359 0.57604 0.60380 0.62757
##                          PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.1675 1.1115 1.10839 1.10049 1.09180 1.08293 1.06248
## Proportion of Variance 0.0235 0.0213 0.02118 0.02088 0.02055 0.02022 0.01946
## Cumulative Proportion  0.6511 0.6724 0.69355 0.71443 0.73498 0.75520 0.77467
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     1.04181 1.02235 0.95146 0.93317 0.89832 0.84948 0.82282
## Proportion of Variance 0.01871 0.01802 0.01561 0.01501 0.01391 0.01244 0.01167
## Cumulative Proportion  0.79338 0.81140 0.82701 0.84202 0.85594 0.86838 0.88005
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.80809 0.78173 0.75746 0.74714 0.73103 0.68603 0.67857
## Proportion of Variance 0.01126 0.01054 0.00989 0.00962 0.00921 0.00811 0.00794
## Cumulative Proportion  0.89131 0.90185 0.91174 0.92136 0.93058 0.93869 0.94663
##                           PC36    PC37   PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.66318 0.60379 0.5647 0.54828 0.49432 0.49033 0.47134
## Proportion of Variance 0.00758 0.00629 0.0055 0.00518 0.00421 0.00415 0.00383
## Cumulative Proportion  0.95421 0.96050 0.9660 0.97118 0.97539 0.97954 0.98337
##                          PC43    PC44    PC45    PC46    PC47   PC48    PC49
## Standard deviation     0.4377 0.42836 0.35967 0.31239 0.28277 0.2748 0.24416
## Proportion of Variance 0.0033 0.00316 0.00223 0.00168 0.00138 0.0013 0.00103
## Cumulative Proportion  0.9867 0.98983 0.99206 0.99375 0.99513 0.9964 0.99745
##                           PC50    PC51    PC52    PC53    PC54     PC55
## Standard deviation     0.22955 0.19366 0.17714 0.16042 0.01593 0.006814
## Proportion of Variance 0.00091 0.00065 0.00054 0.00044 0.00000 0.000000
## Cumulative Proportion  0.99836 0.99901 0.99955 0.99999 1.00000 1.000000
##                             PC56      PC57      PC58
## Standard deviation     1.554e-05 5.273e-15 4.019e-15
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00

Similar to pcr(), we have a summary table with all PCs. Instead, here we have Proportion of Variance and Cumulative Proportion available to us.

Next, visualize eigenvalues (scree plot). We will use fviz_eig located within the factoextra package. Show the percentage of variances explained by each principal component. 58 is passed in for ncp in order to visualize all PCs. If you would like to view less, you can change this input.

library(factoextra)
fviz_eig(prcomp.fit, ncp = 58)

At this point, we could have either chosen a threshold for proportion of variance explained or rely on finding an elbow in the scree plot. Let’s say we chose a threshold of 85% proportion of variance explained. We would then see that we need to add components up until PC26 (including PC26) in order to meet this threshold, and would therefore keep the PCs up until and including PC26 in the model. In contrast, if we were to use the scree plot to find the elbow, there is arguably a large drop around PC5, at which point only 31% of variance is explained. Even if we were to continue until the next drop and elbow at PC12, only 58% of variance is explained.

More info on the prcomp is available here.

More info on the factoextra package is available here.

Partial Least Squares (PLS)

Partial least squares (PLS) is very similar to principal components regression (PCR), in that both generate a new set of features that are linear combinations of the original features. However, unlike PCR, PLS incorporates information from the response variable when creating these new features. In doing so, PLS attempts to identify directions that explain the most variance in both the features and the response, rather than just explaining the variance in the features. Because PCR only considers the variance in the features, it may not be best for prediction, particularly if there is some variance in the features unrelated to the variance in the response.

PLS can be thought of as a supervised alternative to PCR’s unsupervised approach. This is not to say that PCR cannot be used for supervised learning problems (it can be used for either supervised or unsupervised problems), but instead describes the method of generating the new features. PLS, on the other hand, can only be used for supervised problems, since the response variable is used in the creation of the new features.

Algorithm

PLS generates new features \(Z\) that are linear combinations of the original features \(X\). This means that the \(m\)th new feature can be written as \(Z_m = \sum_{j = 1}^p \phi_{jm}X_j\), where \(p\) is the total number of original features.

This process for creating these new features is as follows:

  1. Standardize all features.

  2. Initialize \(X^{(0)} = X\) and \(m=1\).

  3. Compute the first direction \(Z_1\) by setting \(\phi_{j1}\) equal to the coefficient of ordinary least squares linear regression between \(Y\) and \(X_j\). In doing so, the \(X_j\)’s most highly correlated with \(Y\) receive the highest weights in \(Z_1\).

  4. Orthogonalize \(X\) with respect to \(Z_1\) by regressing each variable \(X_j\) on \(Z_1\) and taking the residuals. Denote these orthogonalized values \(X^{(1)}\). This means that \(X^{(1)}\) is essentially the information not already explained by the first direction.

  5. Repeat steps 2-4 to calculate each \(Z_m\) for \(m = 2, ..., p\), using the orthogonalized \(X^{(m-1)}\) output from the previous step instead of \(X\) to generate \(Z_m\). So \(X^{(1)}\) is used to calculate \(Z_2\), \(X^{(2)}\) is used to calculate \(Z_3\), and so on.

  6. Use ordinary least squares to regress \(Y\) on \(Z_1, ..., Z_m\) for some \(m \le p\). Generally use cross-validation to choose the best value of \(m\).

Code Implementation

To implement PLS in R, we can use the pls package. After loading the package, the basic syntax for fitting PLS regression is very similar to that of linear regression with the lm function, though in this case we will use the plsr function. The first argument is a formula to tell the function what the response and predictor variables are. In this case, we’re using online news data with shares as the response and all other variables as predictors. To avoid having to type out 58 variables in the formula, we can instead use shares ~ . to indicate that all columns besides shares should be treated as predictor variables. Additionally, to standardize the predictors, we can specify scale = TRUE and center = TRUE.

#load the pls library
library(pls)

#fit the pls regression
pls_fit <- plsr(shares ~ ., data = news, scale = TRUE, center = TRUE)

summary(pls_fit)
## Data:    X dimension: 39644 58 
##  Y dimension: 39644 1
## Fit method: kernelpls
## Number of components considered: 58
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         6.381   10.905   16.268   19.975   24.040   28.063   30.560   33.075
## shares    1.624    1.896    2.001    2.075    2.121    2.154    2.202    2.238
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X        35.176    37.555     39.87    42.113    45.369    47.846    50.660
## shares    2.263     2.274      2.28     2.283     2.285     2.287     2.289
##         16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X          53.01    54.760    57.655    59.601    62.131    63.598    65.113
## shares      2.29     2.292     2.292     2.293     2.293     2.293     2.293
##         23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X         67.478    69.166    70.471    71.769    73.544    75.906    77.257
## shares     2.293     2.293     2.293     2.293     2.293     2.293     2.294
##         30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X         78.732    79.454    80.527    81.817    82.824    83.744    84.837
## shares     2.294     2.294     2.294     2.295     2.296     2.297     2.298
##         37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X         85.578    86.978    87.800    88.806    90.166    90.601    92.263
## shares     2.301     2.303     2.304     2.304     2.304     2.304     2.304
##         44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X         93.097    94.168    95.511    96.111    96.761    97.366    97.875
## shares     2.304     2.304     2.304     2.304     2.304     2.304     2.304
##         51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps
## X         98.610    99.366    99.606    99.922    100.00    100.00    100.00
## shares     2.304     2.305     2.305     2.306      2.31      2.31      2.31
##         58 comps
## X         100.00
## shares      2.31

The summary of the fit shows the percentage of variance explained by each additional component. In this example, we see that five components explain 24% of the variance in the data, and 10 components explain 38% of the variance in the data.

We can also make predictions for new data using our PLS model. As with fitting the model, this process is very similar to ordinary least squares regression. We will use the predict function, but will need to include an additional parameter ncomp to specify the number of components to include.

More info on the pls package is available here.

#predict for given number of components (3 in this case)
predict(pls_fit, ncomp = 3, newdata = news[39640:39644,])
## , , 3 comps
## 
##          shares
## 39640 3631.6429
## 39641 5093.6567
## 39642 5594.3732
## 39643  799.7493
## 39644 1139.6305

While the percentage of variance explained gives us some information about how many components to include, best practice is to choose this number using cross-validation. One method of cross-validation would involve a double for loop to iterate through subsets of the data and number of components, making predictions and calculating an error measure (generally root mean squared error for numeric predictor variables) for each combination, then choosing the number of components with the smallest RMSE. However, the caret package can simplify this process considerably using the train function. We need to include the same parameters as in the function call to plsr, along with a few additional parameters. We specify method = "pcr" since train can handle many different model types, trControl = trainControl("cv", number = 10) to indicate that we want 10-fold cross-validation, and tuneLength = 10 to indicate that we want to test values from 1 to 10 for the ncomp parameter.

library(caret)

#set seed for reproducibility, since CV involves random partitions
set.seed(300)
 
pls_cv <- train(shares ~ ., data = news, scale = TRUE, center = TRUE, 
               method = "pcr", trControl = trainControl("cv", number = 10), 
               tuneLength = 10)

plot(pls_cv)

pls_cv$bestTune
##   ncomp
## 2     2

From cross-validation, we see that two components is the best choice, and that including more than that may lead to overfitting.

More info on the caret package can be found here.

Pros & Cons

PLS can perform well for prediction when there are many features, even when multicollinearity is present, violating the assumptions for ordinary least squares (OLS). Multicollinearity occurs when some predictor variables are highly correlated with some combination of other predictor variables, and occurs increasingly often as the number of predictor variables grows. Therefore PLS can be useful for prediction with high-dimensional data (when the number of predictors \(p\) is large relative to the number of observations \(n\)). However, PLS is less interpretable than OLS in understanding the underlying relationship between features and response, since it involves the creation of new features.

PLS generally has lower bias but higher variance than PCR.

PLS can have minor instability in its estimates because it tends to shrink low-variance directions but sometimes inflates high-variance directions. Ridge regression shrinks all directions, but especially low-variance directions, while PCR discards low-variance directions.

PLS, PCR, and ridge regression typically perform very similarly, but for minimizing prediction error, ridge regression is generally preferred because it shrinks smoothly as opposed to in discrete steps.

Conclusion

In summary, it is important to consider the number of predictors being used to fit your model in OLS. Commonly, not all predictors are necessary and instead can lead to overfitting. We have outlined a few methods for either removing or transforming these predictors and unfortunately, one method is not always going to be the best method - it depends on the context of your data and the problem. Ridge does tend to most frequently outperform the others, but it’s important to use discretion and due diligence to test or consider a couple methods, and then perhaps compare the resulting models using MSE.

We hope you’ve enjoyed our tutorial!

References

[1] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R. New York: Springer, 2013.

[2] T. Hastie, J. Friedman, and R. Tisbshirani, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer, 2009.

[3] R. Merrett, “Online News Popularity.” Available: https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Online News Popularity [Accessed: 03-Dec-2020].

[4] M. Brems, “A One-Stop Shop for Principal Component Analysis.” Available: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c. [Accessed: 04-Dec-2020].

[5] D. Montgomery, E. Peck, and G. Vining, Introduction To Linear Regression Analysis. 5th ed. Wiley, 2012.

[6] Z, Jaadi, “A Step-by-Step Explanation of Principal Component Analysis.” Available: https://builtin.com/data-science/step-step-explanation-principal-component-analysis. [Accessed: 05-Dec-2020].

[7] R. Tobias, “An Introduction to Partial Least Squares Regression.” Available: https://stats.idre.ucla.edu/wp-content/uploads/2016/02/pls.pdf. [Accessed: 03-Dec-2020].

[8] Kassambara, “Principal Component and Partial Least Squares Regression Essentials,” STHDA, 11-Mar-2018. [Online]. Available: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/152-principal-component-and-partial-least-squares-regression-essentials/. [Accessed: 07-Dec-2020].

[9] B.-H. Mevik and R. Wehrens, “Introduction to the pls Package,” 04-Aug-2020. [Online]. Available: https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf. [Accessed: 07-Dec-2020].

[10] M. Kuhn, “The caret Package,” 27-Mar-2019. [Online]. Available: http://topepo.github.io/caret/index.html. [Accessed: 07-Dec-2020].

[11] M. Porter, “Data Mining Lecture .” Data Mining Lecture . 10-Sept-2020, Charlottesville. https://mdporter.github.io/SYS6018/lectures/03-penalized.pdf

[12] J. Friedman et al., glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. 2020. https://CRAN.R-project.org/package=glmnet

[13] M. Porter, “Supervised Learning II Lecture .” 01-Sept-2020, Charlottesville. https://mdporter.github.io/SYS6018/lectures/03-penalized.pdf

[14] G. Elumalai, “Pros and cons of common Machine Learning algorithms,” Medium, Nov. 19, 2019. https://medium.com/@gokul.elumalai05/pros-and-cons-of-common-machine-learning-algorithms-45e05423264f [Accessed 07-Dec-2020].

[15] “Penalized Regression Essentials: Ridge, Lasso & Elastic Net - Articles - STHDA.” http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/ [Accessed 07-Dec-2020].

[16] A. Kassambara et al., factoextra: Extract and Visualize the Results of Multivariate Data Analyses. 2020. https://CRAN.R-project.org/package=factoextra